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Abstract 

We bound the future loss when predicting any (computably) stochastic 
sequence online. Solomonoff finitely bounded the total deviation of his uni- 
versal predictor M from the true distribution /i by the algorithmic complexity 
of iJL. Here we assume we are at a time t>l and already observed x = xi...xt. 
We bound the future prediction performance on xt+iXt+2--- by a new variant 
of algorithmic complexity of given x, plus the complexity of the random- 
ness deficiency of x. The new complexity is monotone in its condition in the 
sense that this complexity can only decrease if the condition is prolonged. We 
also briefiy discuss potential generalizations to Bayesian model classes and to 
classification problems. 
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1 Introduction 



We consider the problem of onlinc=scquential predictions. We assume that 
the sequences x = 0:1X2X3. .. are drawn from some "true" but unknown prob- 
abihty distribution /i. Bayesians proceed by considering a class Ai of mod- 
els=hypotheses=distributions, sufficiently large such that heA4, and a prior over 
M.. Solomonoff considered the truly large class that contains all computable prob- 
ability distributions [Sol64] . He showed that his universal distribution M converges 
rapidly to /x [Sol78], i.e. predicts well in any environment as long as it is computable 
or can be modeled by a computable probability distribution (all physical theories are 
of this sort). M(x) is roughly 2~^^^\ where K(x) is the length of the shortest de- 
scription of X, called Kolmogorov complexity of x. Since K and M are incomputable, 
they have to be approximated in practice. See e.g. [Sch02b, Hut04, LV97, CV05] 
and references therein. The universality of M also precludes useful statements of the 
prediction quality at particular time instances n [Hut04, p62], as opposed to simple 
classes like i.i.d. sequences (data) of size n, where accuracy is typically 0(n~^/^). 
Luckily, bounds on the expected toto/=cumulative loss (e.g. number of prediction 
errors) for M can be derived [Sol78, Hut03a, Hut03b], which is often sufficient in an 
online setting. The bounds are in terms of the (Kolmogorov) complexity of fi. For 
instance, for deterministic fi, the number of errors is (in a sense tightly) bounded by 
i^(/x) which measures in this case the information (in bits) in the observed infinite 
sequence x. 

What's new. In this paper we assume we are at a time t > 1 and already ob- 
served X — X\ ...Xf. Hence we are interested in the future prediction performance on 
Xt+iXt+2---, since typically we don't care about past errors. If the total loss is finite, 
the future loss must necessarily be small for large t. In a sense the paper intends 
to quantify this apparent triviality. If the complexity of /i bounds the total loss, a 
natural guess is that something like the conditional complexity of /i given x bounds 
the future loss. (If x contains a lot of (or even all) information about n, we should 
make fewer (no) errors anymore.) Indeed, we prove two bounds of this kind but 
with additional terms describing structural properties of x. These additional terms 
appear since the total loss is bounded only in expectation, and hence the future loss 
is small only for "most" Xi...Xt. In the first bound (Theorem 1), the additional term 
is the complexity of the length of x (a kind of worst-case estimation). The second 
bound (Theorem 7) is finer: the additional term is the complexity of the randomness 
deficiency of x. The advantage is that the deficiency is small for "typical" x and 
bounded on average (in contrast to the length). But in this case the conventional 
conditional complexity turned out to be unsuitable. So we introduce a new natural 
modification of conditional Kolmogorov complexity, which is monotone as a func- 
tion of condition. Informally speaking, we require programs (^descriptions) to be 
consistent in the sense that if a program generates some given x, then it must 
generate the same n given any prolongation of x. The new posterior bounds also 
significantly improve the previous total bounds. 

Contents. The paper is organized as follows. Some basic notation and definitions 
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are given in Sections 2 and 3. In Section 4 we prove and discuss the length-based 
bound Theorem 1. In Section 5 we show why a new definition of complexity is neces- 
sary and formulate the deficiency-based bound Theorem 7. We discuss the definition 
and basic properties of the new complexity in Section 6, and prove Theorem 7 in 
Section 7. We briefly discuss potential generalizations to general model classes Ai 
and classification in the concluding Section 8. 

2 Notation & Definitions 

We essentially follow the notation of [LV97, Hut04]. 

Strings and natural numbers. We write X* for the set of finite strings over a 
finite alphabet X, and for the set of infinite sequences. The cardinality of a 
set S is denoted by \S\. We use letters i,k,l,n,t for natural numbers, u,v,x,y,z for 
finite strings, e for the empty string, and a — ai:^ etc. for infinite sequences. For 
a string x of length £{x) —n we write XiX2---Xn with XtEX and further abbreviate 
Xk:n'= XkXk+i---Xn-iXn and a;<„ 1= Xi . . . For XfEX, denote by Xt an arbitrary 
clement from X such that Xt^Xf. For binary alphabet A" = {0,1}, the Xt is uniquely 
defined. We occasionally identify strings with natural numbers. 

Prefix sets. A string x is called a (proper) prefix of y if there is a z{^e) such that 
xz = y; y is called a prolongation of x. We write x* = y in this case, where * is a 
wildcard for a string, and similarly for infinite sequences. A set of strings is called 
prefix free if no element is a proper prefix of another. Any prefix set V has the 
important property of satisfying Kraft's inequality '^xev\^\~^^^^ — 

Asymptotic notation. We write f{x)<g{x) for f{x) = 0{g{x)) and f{x)<g{x) 
for f{x) < g{x) +0(1). Equalities =, = are defined similarly: they hold if the 
corresponding inequalities hold in both directions. 

(Semi) measures. We call piA"*— >[0,1] a (semi)measure «j9^X]a;„eA'P(^i:n) — V(^<n) 
and p(e) =■*!. p{x) is interpreted as the /)-probability of sampling a sequence which 
starts with x. The conditional probability (posterior) p{y\x) := is the p- 

probability that a string x is followed by (continued with) y. We call p deterministic 
if 3a : p{ai;n) = 1 Vn. In this case we identify p with a. 

Random events and expectations. We assume that sequence ui=uii:qo is sampled 
from the "true" measure /x, i.e. Y*[u}i,n=Xi;r^= p{xi,n) ■ We denote expectations w.r.t. 
H by E, i.e. for a function f-.X^'^M, E[/] = E[/(u;i,„)] = Xla,,^„At(a;i:n)/(a;i:n). We 
abbreviate :=//(xt|a;<t). 

Enumerable sets and functions. A set of strings (or naturals, or other construc- 
tive objects) is called enumerable if it is the range of some computable function. A 
function /: X*^IR is called (co-) enumerable if the set of pairs {{x,^) \f{x)''>-} is 
enumerable. A measure p is called computable if it is enumerable and co-enumerable 
and the set {x\p{x)=0} is decidable (i.e. enumerable and co-enumerable). 
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Prefix Kolmogorov complexity. The conditional prefix complexity K[y\x) :— 
min{£(p) : U{p,x) =y} is the length of the shortest binary (self-delimiting) program 
pe{0,l}* on a universal prefix Turing machine U with output yEX* and input xeX* 
[LV97]. K(x) -.— Klxle). For non-string objects o we define K{o) :—K({o)), where 
(o) G X* is some standard code for o. In particular, if is an enumeration 

of all (co-)enumerable functions, wc define K{fi) ■.= K{i). We need the following 
properties: The co-enumerability of K, the upper bounds K{x\l{x)) ^l{x)\og2\X\ 
and K{n) ^ 21og2n, Kraft's inequality X]j.2~^^^^ < 1, the lower bound K{x) > l(x) 
for "most" X (which implies K{n) ^-ll!^ oo), extra information bounds K{x\y) ^ 
K{x) < K{x,y), subadditivity K{xy) <K{x,y) < K{y)+K{x\y), information non- 
increase K{f{x))<K{x) + K{f) for computable f:X*^X*, and coding relative to 
a probabihty distribution (MDL): if P : A'* — > [0,1] is enumerable and Yl^P{x) < 1, 
then K{x) ^ -log2P{x)+K{P). 

Monotone and SolomonofF complexity. The monotone complexity K'm{x) :— 
mm{£(p) : U{p) = x*} is the length of the shortest binary (possibly non-halting) 

program {0,1}* on a universal monotone Turing machine U which outputs a 
string starting with x. Solomonoff's prior M(x) := X]p-i/(p)=a;*2~^*'^^ =• 2"™^*'^^ is the 
probability that U outputs a string starting with x if provided with fair coin flips 
on the input tape. Most complexities coincide within an additive term 0{logi{x)), 
e.g. K(x\£{x))^KM(x)<Km{x)<K(x), hence similar relations as for K hold. 

3 Setup 

Convergent predictors. We assume that is a "true" ^ sequence generating mea- 
sure, also called environment. If we know the generating process ^, and given past 
data a;<t, we can predict the probability ii{xt\x^t) of the next data item xt- Usually 
we do not know /i, but estimate it from x<f Let p{xt\x<:t) be an estimated prob- 
ability^ of Xt, given x^f Closeness of p{xt\x^t) to iJ,{xt\x^t) is desirable as a goal 
in itself or when performing a Baycs decision yt that has minimal p-expcctcd loss 
/f (x<f) := minj^j^^^Loss(a;t,i/()p(a;t|x<t). Consider, for instance, a weather data se- 
quence xi:n with Xt = l meaning rain and Xt = meaning sun at day t. Given x^t the 
probability of rain tomorrow is /i(l|a;<t). A weather forecaster may announce the 
probability of rain to be yt :—p{l\x^t), which should be close to the true probability 
//(l|x<t). To aim for 

p{x[\x^t) — f^{x'-i.\x^t) for i — > oo 

seems reasonable. 

Convergence in mean sum. We can quantify the deviation of pt from pt, ^-g- by 
the squared difference 

st{u}<t) ■■= "^{pixtlu^t) - ^i{xt\u}<t)f = ^{pt-ptf 

xt€X Xt 

"'^Also called objective or aleatory probability or chance. 
^Also called subjective or belief or epistemic probability. 
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Alternatively one may also use the squared absolute distance := ^(^^.Jpt — 
the Hellinger distance St:=J2^^{^/pi-^/]Mf, the KL-divergence := ^^^/xjn^, or 
the squared Bayes regret — Zf)^ for Z^e [0,1]. For all these distances one can 

show [Hut03a, Hut04] that their cumulative expectation from Z to n is bounded as 
follows: 

< Eij^sM < E[ln^f^^^|a;<,] D^u^i). (1) 

Di-n is increasing in n, hence Di-^o £ [O.cxo] exists [HutOl, Hut04]. A sequence of 
random variables like St is said to converge to zero with probability 1 if the set 
{uj : St{oj) — > 0} has measure 1. is said to converge to zero in mean sum if 
^^^E[|st|] < c< oo, which implies convergence with probability 1 (rapid if c is 
of reasonable size). Therefore a small finite bound on Di-^o would imply rapid 
convergence of the St defined above to zero, hence pf fit and fast. So 

the crucial quantities to consider and bound (in expectation) are In^ if /=! and 
In^l^jly for Z>1. For illustration we will sometimes loosely interpret Di.^ and other 
quantities as the number of prediction errors, as for the error-loss they are closely 
related to it [HutOl]. 

Bayes mixtures. A Bayesian considers a class of distributions A4 :— {1/1,1/2,...}, 
large enough to contain /i, and uses the Bayes mixture 

^{x) := ^Wi,-i/{x), ^ = 1, > 0. (2) 

for prediction, where Wi, can be interpreted as the prior of (or initial belief in) u. 
The dominance 

^{x) > w^-nix) Vx e X* (3) 

is its most important property. Using p — ^ for prediction, this implies Di.^ < 
Inw^^ < 00, hence Pt- If is chosen sufficiently large, then e is not a 
serious constraint. 

Solomonoff prior. So we consider the largest (from a computational point of 
view) relevant class, the class M.jj all enumerable semimeasures (which includes 
all computable probability distributions) and choose Wy = 2~^^'^^ which is biased 
towards simple environments (Occam's razor). This gives us Solomonoff-Levin's 
prior M [Sol64, ZL70] (this definition coincides within an irrelevant multiplicative 
constant with the one in Section 2). In the following we assume M.=M.u, p = ^ = M, 
Wu — 2'^^"^ and jj, G Mu being a computable (proper) measure, hence M{x) > 
2-^^^'^fx{x)Vx by (3). 

Prediction of deterministic environments. Consider a computable sequence 
a = cti:oo "sampled from /lEAd" with n{a) = l, i.e. p is deterministic, then from (3) 
we get 

00 00 

J2\'^-M{at\a^t)\ < -J2^nM{at\a^t) = -lnM{a,.,^) < X(/x)ln2<oo, (4) 

t=i t=i 
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which imphes that M(Q;t|a<t) converges rapidly to 1 and hence M(Q;t|Q!<t) — >0, i.e. 
asymptotically M correctly predicts the next symbol. The number of prediction 
errors is of the order of the complexity K(ijL)^Km(a) of the sequence. 

For binary alphabet this is the best we can expect, since at each time-step only 
a single bit can be learned about the environment, and only after we "know" the 
environment we can predict correctly. For non-binary alphabet, K{fi) still measures 
the information in /i in bits, but feedback per step can now be log2|A:'| bits, so we 
may expect a better bound K{ijL)/\og2\X\. But in the worst case all atE{0,l}CX. 
So without structural assumptions on ^ the bound cannot be improved even if X is 
huge. We will see how our posterior bounds can help in this situation. 

Individual randomness (deficiency). Let us now consider a general (not nec- 
essarily deterministic) computable measure nEAi. The Shannon- Fano code of x 
w.r.t. /X has code-length [— log2/i(a;)] , which is "optimal" for "typical/random" x 
sampled from Further, — log2M(a;) ^ K(x) is the length of an "optimal" code 
for X. Hence — log2//(x) f=:i — log2M(x) for "//-typical/random" x. This motivates the 
definition of ^-randomness deficiency 



d^{x) log2 



M{x) 
H{x) 



which is small for "typical/random" x. Formally, a sequence a is called (Martin- 
Lof) random iff dn{a) :=sup„(i^(Q;i:„) <oo, i.e. iff its Shannon-Fano code is "optimal" 
(note that d^j,{a) > —K{jj) > — oo for all sequences), i.e. iff 



sup ^ log 



fj,(at\a 



<t) 



M{at\a<t) 



= sup 

n 



log 



/U(«l:n) 



M(ai:n) 



< OO. 



Unfortunately this does not imply Mt^Ht on the /i-random a, since Mj may oscil- 
late around fit, which indeed can happen [HM04]. But if we take the expectation, 
Solomonoff [Sol78, HutOl, Hut04] showed 

oo 

^ "EEV(Mi-/i^)2 < Di.oo = limE[-d^{u;,,n)]ln2 < K{i^)ln2 < oo (5) 

' ' n— >oo 

t=l xt 

hence, Mt^^t with /^-probability 1. So in any case, df^{x) is an important quantity, 
since the smaller —d^{x) (at least in expectation) the better M predicts. 



4 Posterior Bounds 



Posterior bounds. Both bounds, (4) and (5) bound the total (cumulative) discrep- 
ancy (error) between Mf and /if Since the discrepancy sum Di.^ is finite, we know 
that after sufficiently long time t = l, we will make little further errors, i.e. the future 
error sum D^oo is small. The main goal of this paper is to quantify this asymptotic 
statement. So we need bounds on log2 Mn\ ' where x are past and y are future 
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observations. Since log2-^j^ < K (/j.) and iJ,{y\x)/M{y\x) are conditional versions of 
true/universal distributions, it seems natural that the unconditional bound K{fj,) 
also simply conditionahzes to log2 Mil\I) The more information the past 

observation x contains about fi, the easier it is to code n i.e. the smaller is K(iJ,\x), 
and hence the less future predictions errors Dioo we should make. Once x contains 
all information about /i, i.e. K{n\x) = 0, we should make no errors anymore. More 
formally, optimally coding x then fj,\x and finally y\f^,x by Shannon- Fano, gives a 
code for xy, hence K{xy)<K{x)+K{ijL\x)+log2ijL{y\x)'~^. Since K (z) —log2M (z) 
this implies log2j^^^K(ii\x), but with logarithmic fudge that tends to infinity 
for i{y) oo, which is unacceptable. The y-independent bound we need was first 
stated in [Hut04, Prob.2.6(m)]: 

Theorem 1. For any computable measure /i and any x,y&X* it holds 

log.^il ^ K{^^\x) + K{i{x)). 
Proof. For any fixed / we define the following function of zeX*. For i{z)>l, 
Mz) := 5^ 2-^('^l^i^')M(zi,)^(^W:^(.)) ■ 

For i{z)<l we extend ipi by defining ipi{z) ■='^u-i{u)=i-e{z)'^i^'^)- ^^^^ 
t/ji is an enumerable semimeasure. By definition of M, we have M{z) >2~^^'^'-'>t/ji{z) 
for any / and z. Now let l — i{x) and z — xy. Let us define a semimeasure //^(y) '■— 
li{y\x). Then 

Taking the logarithm, after trivial transformations, we get \og2j^^^ < K (fi,j.\x) + 

K{ipi). To complete the proof, let us note that K{ipi) ^K{1) and K{jjix\x) -^K^jjiix). 

□ 

Corollciry 2. The future and total deviations of Mt from /it are bounded by 

Et=mE[Sikl:,] < A+l:ooKO ^ iK{fi\LU,.,i)+K{l))\n2 
^0 Et=iEh] ^ mmi{E[K{fi\uj,..i) + K{l)]\n2 + 2l} 

Proof. {{) The first inequality is (1) and the second follows by taking the conditional 
expectation E[-|cl;i:/] in Theorem 1. (ii) follows from (i) by taking the unconditional 
expectation and from E[=iE[st] <2l, since St<2. □ 



Examples and more motivation. The bounds Theorem 1 and Corollary 2{i) 
prove and quantify the intuition that the more we know about the environment, 
the better our predictions. We show the usefulness of the new bounds for some 
deterministic environments n=a. 
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Assume all observations are identical, i.e. a = xiXiXi.... Further assume that X 
is huge and K{xi)=\og2\X\, i.e. xi is a typical/random/ complex element of X. For 
instance if xi is a 256^ color 512x512 pixel image, then | A"] = 256^^^^^^^^^. Hence the 
standard bound (5) on the number of errors i^i;oo/ln2<ir(/i)±ii:(a;i) = 3-22i is huge. 
Of course, interesting pictures are not purely random, but their complexity is often 
only a factor 10.. 100 less, so still large. On the other hand, any reasonable prediction 
scheme observing a few (rather than several thousands) identical images, should 
predict that the next image will be the same. This is what our posterior bound 
gives, D2:oc{xi)^K{ij,\xi)+K{l) = Q^ hence indeed M makes only Yl^i^[^t] = 0{l) 
errors by Corollary 2(n), significantly improving upon Solomonoff's bound i^(/i)ln2. 

More generally, assume a = xuj, where the initial part x = x\;i contains all infor- 
mation about the remainder, i.e. K[iJi\x) = K{u)\x) = 0. For instance, x may be a 
binary program for tt or e and uj be its |A'|-ary expansion. Sure, given the algorithm 
for some number sequence, it should be perfectly predictable. Indeed, Theorem 1 
implies Dij^i.ao<K{l), which can be exponentially smaller than Solomonoff's bound 
K{fi) (=1 if K{x)=l{x)). On the other hand, K{1) >log2/ for most /, i.e. is larger 
than 0(1) what one might hope for. 

Logarithmic versus constant accuracy. So there is one blemish in the bound. 
There is an additive correction of logarithmic size in the length of x. Many theorems 
in algorithmic information theory hold to within an additive constant, sometimes 
this is easily reached, sometimes hard, sometimes one needs a suitable complexity 
variant, and sometimes the logarithmic accuracy cannot be improved [LV97]. The 
latter is the case with Theorem 1: 

Lemma 3. For X = {Q,\}, for any computable measure /i, there exists a computable 
sequence a&{0,l}°° such that for any /eW 

6e{o,i} ^ ^^^'■^ 

Proof. Let us construct a computable sequence aE {0,1}°° by induction. Assume 
that a^i is constructed. Since is a measure, either //(0|q;</) >c or ii{l\a^i) >c for 
c:= [31n2]~^ < ^. Since fj, is computable, we can find (effectively) 6e {0,1} such that 
li{b\a^i)>c. Put ai = b. 

Let us estimate M{ai\a^i). Since a is computable, M(a<|) > 1. We claim that 
M{a^iai) <2~^('). Actually, consider the set {a<:iai\l>0}. This set is prefix free and 
decidable. Therefore P{1) — M{a<iai) is an enumerable function with ^iP{l) < 1, 
and the claim follows from the coding theorem. Thus, we have M{ai\a<^i) <2~^^'^ 
for any I. Since //(q;;|q;<;) >c, we get 

XI l^{b\a<i) < ^- I M c . p 

> uiola^i) In ——, ^ > u(Q;;Q;<;)ln — ^7777 + mm pin—— — ; ^ 

6e7^i} MibM - " 2-m pe[o,i-c]^ M{ai\a^i) 



> cK{l) In 2 

□ 
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A constant fudge is generally preferable to a logarithmic one for quantitative 
and aesthetical reasons. It also often leads to particular insight and/or interesting 
new complexity variants (which will be the case here). Though most complexity 
variants coincide within logarithmic accuracy (see [SchOO, Sch02a] for exceptions), 
they can have very different other properties. For instance, Solomonoff complexity 
KM (x) = —log2M (x) is an excellent predictor, but monotone complexity Km can be 
exponentially worse and prefix complexity K fails completely [Hut03c]. 

Exponential bounds. Bayes is often approximated by MAP or MDL. In our 
context this means approximating KM by Km with exponentially worse bounds 
(in deterministic environments) [Hut03c]. (Intuitively, since an error with Bayes 
eliminates half of the environments, while MAP/MDL may eliminate only one.) 
Also for more complex "reinforcement" learning problems, bounds can be 2^^^'' 
rather than K{fj,) due to sparser feedback. For instance, for a sequence XiXiXi... if 
we do not observe xi but only receive a reward if our prediction was correct, then the 
only way a universal predictor can find xi is by trying out all \X\ possibilities and 
making (in the worst case) \X\ — 1= 2^^^'' errors. Posterization allows to boost such 
gross bounds to useful bounds 2^*^^'^^'' = 0(1). But in general, additive logarithmic 
corrections as in Theorem 1 also exponentiate and lead to bounds polynomial in / 
which may be quite sizeable. Here the advantage of a constant correction becomes 
even more apparent [Hut04, Problems 2.6, 3.13, 6.3 and Section 5.3.3]. 



5 More Bounds and New Complexity Measure 

Lemma 3 shows that the bound in Theorem 1 is attained for some binary strings. 
But for other binary strings the bound may be very rough. (Similarly, K{x) is 
greater than i{x) infinitely often, but K{x) <^i{x) for many 'interesting" x.) Let us 
try to find a new bound, which does not depend on i{x). 

First observe that, in contrast to the unconditional case (5), K{ii) is not an upper 
bound (again by Lemma 3). Informally speaking, the reason is that M can predict 
the future very badly if the past is not "typical" for the environment (such past x 
have low //-probability, therefore in the unconditional case their contribution to the 
expected loss is small). So, it is natural to bound the loss in terms of randomness 
deficiency d^{x), which is a quantitative measure of "typicalness" . 

Theorem 4. For any computable measure /i and any x,y & {0,1}* it holds 

log2^^ = d,{x)-d,{xy) ^ K{i^) + K{\d^{x)]). 

Theorem 4 is a variant of the "deficiency conservation theorem" from [VSU05] . 
We do not know who was the first to discover this statement and whether it was 
published (the special case where /j, is the uniform measure was proved by An. Much- 
nik as an auxiliary lemma for one of his unpublished results; then A. Shen placed a 
generalized statement to the (unfinished) book [VSU05]). 
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Now, our goal is to replace K{iJ,) in the last bound by a conditional complexity 
of fi. Unfortunately, the conventional conditional prefix complexity is not suitable: 

Lemma 5. Let A* = {0,1}. There is a constant Cq such that for any l&IN, there 
are a computable measure ix and a;e{0,l}^ such that 

K[^\x) < Co, d^{x) < Co, and 

6e{o,i} ^ ' ' 

Proof. For I e IN , define a deterministic measure such that is equal to 1 on 
the prefixes of 0'l°° and is equal to otherwise. 

Let X = 0'. Then i2i{x) = 1, i2i{xO) = 0, fii(xl) = 1. Also 1 > M{x) > M{xO) > 
M(0°°) = 1 and (as in the proof of Lemma 3) M{xl) ^ 2-^'^^\ Trivially, di,^{x) = 
log2M(x) = l, and K(ijLi\x)^K(ijLi\1)^0. Thus, K(ijLi\x) and are bounded by a 

constant Co independent of I. On the other hand, X]b6{o lyl^m^)^^ M(b|x) ~^^ m{i\x) — 
K(l)ln2. (One can obtain the same result also for non-deterministic for example, 
taking /x/ mixed with the uniform measure.) □ 

Informally speaking, in Lemma 5 we exploit the fact that K{y\x) can use the 
information about the length of the condition x. Hence K{y\x) can be small for a 
certain x and is large for some (actually almost all) prolongations of x. But in our 
case of sequence prediction, the length of x grows taking all intermediate values and 
cannot contain any relevant information. Thus we need a new kind of conditional 
complexity. 

Consider a Turing machine T with two input tapes. Inputs are provided without 
delimiters, so the size of input is defined by the machine itself. Let us call such a ma- 
chine twice prefix. We write that T{x,y) ~z if machine T, given a sequence beginning 
with X on the first tape and a sequence beginning with y on the second tape, halts af- 
ter reading exactly x and y and prints z to the output tape. (Obviously, if T{x,y) = z, 
then the computation does not depend on the contents of the input tapes after x 
and y.) We define CT{y\x) :=mm{i(j))\3k<i{x):T{p,xi;k)=y}- Clearly, CT{y\x) 
is an enumerable from above function of T, x, and y. Using a standard argument 
[LV97], one can show that there exists an optimal twice prefix machine U in the 
sense that for any twice prefix machine T we have Cif{y\x) <CT{y\x). 

Definition 6. Complexity monotone in conditions is defined for some fixed optimal 
twice prefix machine U as 

K^{y\x*) := Cuiy\x) = mm{i{p) \ 3k < i{x) : U{p,Xi.,k) = y} . 

Here * in x* is a syntactical part of the complexity notation, though one may think 
of K^{y\x*) as of the minimal length of a program that produces y given any z—x*. 

Theorem 7. For any computable measure /i and any x,yGX* it holds 
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Note. One can get a slightly stronger variants of Theorems 1 and 7 by replacing 
the complexity of a standard code of jj, by more sophisticated values. First, in any 
effective encoding there are many codes for every and in all the upper bounds 
(including Solomonoff 's one) one can take the minimum of the complexities of all the 
codes for /i. Moreover, in Theorem 1 it is sufficient to take the complexity of fJ^x = 
IJ>{-\x) (and it is sufficient that ii^ is enumerable, while can be incomputable). For 
Theorem 7 one can prove a similar strengthening: The complexity of /x is replaced 
by the complexity of any computable function that is equal to /i on all prefixes and 
prolongations of x. 

To demonstrate the usefulness of the new bound, let us again consider some deter- 
ministic environment ^=a. For A:' = {0,1} and a = x'^ with a; = 0"l. Theorem 1 gives 
the bound K {fi\n) + K (n) = K (n) . Consider the new bound K^,{fi\x*)+K{\dfj_{x)'\). 
Since n is deterministic, we have dn{x) = log2M(a;) = —K{n), and Kddnlx)^) = 
K(K{n)). To estimate K^{n\x*), let us consider a machine T that reads only its 
second tape and outputs the number of Os before the first 1. Clearly, CT{n\x) — 0, 
hence K^{fi\x*) = 0. Finally, K^{iJ,\x*)+K{\dij,{x)]) K{K{n)), which is much 
smaller than K{n). 

6 Properties of the New Complexity 

The above definition of is based on computations of some Turing machine. Such 
definitions are quite visual, but are often not convenient for formal proofs. We will 
give an alternative definition in terms of enumerable sets (see [US96] for definitions 
of unconditional complexities in this style), which summarizes the properties we 
actually need for the proof of Theorem 7. 

An enumerable set E of triples of strings is called K ^-correct if it satisfies the 
following requirements: 

1. if {p,x,yi)eE and {p,x,y2)eE, then yi = y2; 

2. if {p,x,y) G E, then {p',x',y) e E for all p' being prolongations of p and all x' 
being prolongations of x; 

3. if {p,x',y) G E and {p',x,y) G E, and p is a prefix of p' and x is a prefix of x', 
then {p,x,y) G E. 

A complexity of y under a condition x w.r.t. a set E is C^(y|x) = mm{i(p) \ {p,x,y) G 
E}. A X^-correct set E is called optimal if CE{y\x)>CE'iy\x) for any X^-correct set 
E'. One can easily construct an enumeration of all ir*-correct sets, and an optimal 
set exists by the standard argument. 

It is easy to sec that a twice prefix Turing machine T can be transformed to 
a set E such that Cxiylx) —CE{y\x). The set E is constructed as follows: T is 
run on all possible inputs, and if T(p,x) = y, then pairs {p',x',y) are added to E for 
all p' being prolongations of p and all x' being prolongations of x. Evidently, E is 
enumerable, and the second requirement of K^-correctness is satisfied. To verify the 
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other requirements, let us consider arbitrary and {p'2,X2,y2) such 

that p'l and x[ and X2 are comparable (one is a prefix of the other). Then, by 
construction of E, we have T{pi,xi)—yi and T{p2,X2) —y^, and pi and p2, xi and X2 
are comparable too. Since replacing the unused part of the inputs does not affect 
the running of the machine T and comparable words have a common prolongation, 
we get Pi=P2, Xi=X2, and yi = y2- Thus £^ is a i^*-correct set. 

The transformation in the other direction is impossible in some cases: the set 
£; = {(0M")p,0"l5,0) |ne W,p,ge{0,l}*}, where h{n) is if the n-th Turing machine 
halts and 1 otherwise, is X^-correct, but does not have a corresponding machine T: 
using such a machine one could solve the halting problem. However, we conjecture 
that for every set E there exists a machine T such that CT{x\y) = CE{x\y). 

Probably, the requirements on E can be even weaker, namely, the third require- 
ment can be superfiuous. Let us notice that the first requirement of X^-correctness 
allows us to consider the set as a partial computable function: E{p,x) = y iS 
{p,x,y) G E. The second requirement says that E becomes a continuous function 
if we take the topology of prolongations (any neighborhood of {p,x) contains the 
cone {{p*,x*)}) on the arguments and the discrete topology {{y} is a neighborhood 
of y) on values. It is known (see [US96] for references) that different complexities 
(plain, prefix, decision) can be naturally defined in a similar "topological" fashion. 
We conjecture the same is true in our case: an optimal enumerable set satisfying 
the requirements (1) and (2) (obviously, it exists) specifies the same complexity (up 
to an additive constant) as an optimal twice prefix machine. 

It follows immediately from the definition(s) that K^,{y\x*) is monotone as a 
function of x: K^:[y\xz*) <K^{y\x*) for all x, y, z. 

The following lemma provides bounds for K^{x\y*) in terms of prefix complexity 
K. The lemma holds for all our definitions of K^[x\y*). 

Lemma 8. For any x,y&X* it holds 

K{x\y) ^ K,{x\y*) ^ ^in{K{x\yv.i) + K{1)} ^ K{x) . 

In general, none of the bounds is equal to K^{x\y*) even within o{K{x)) term, but 
they are attained for certain y: For every x there is a y such that 

K{x\y) = and K^{x\y*) = K{x) = mixi{K{x\yi..i) + K{1)} , 

l<l{y) 

and for every x there is a y such that 

K{x\y) ± K^{x\y*)^Q and K{x) ^ min{K{x\yr.,i) + K{1)} . 

Corollary 9. The future deviation of Mt from fit is bounded by 
00 

J]E[st|a;u] t [mm{/^(//|a;l:0+i^(^)} + i^(rfM W)] In 2 . 

t=i+i 

Let us note that if uj is //-random, then K{d^j,{uji._i)) '^K[d^{ijJi,ao))-\-K[K[ij))^ 
and therefore we get the bound, which does not increase with I, in contrast to the 
bound (i) in Corollary 2. 
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7 Proof of Theorem 7 



The plan is to get a statement of the form 2'^//(y) ^M(y). where (i^(i^(a;) = log2-^2^. 
To this end, we define a new semimeasure v: we take the set S = {z\d^{z) >d} and 
put V to be 2"'/i on prolongations of z E S; this is possible since S has yU-measure 
2""^. Then we have i^lz) <C -M^z) by universality of M. However, the constant C 
depends on /j, and also on d. To make the dependence explicit, we repeat the above 
construction for all numbers d and all semimeasures /j,^, obtaining semimeasures lyd.T- 
and take u = "^2^^^^^ This construction would give us the term K{ii) 

in the right-hand side of Theorem 7. To get K^,{fi\x*), we need a more complicated 
strategy: instead of a sum of semimeasures i'd,T, for every fixed d we sum "pieces" 
of i'd,T at each point z, with coefficients depending on z and T. 

Now proceed with the formal proof. Let {ii^}TeN be any (effective) enumeration 
of all enumerable semimeasures. For any integer d and any T, put 

Sd,T ■■= {z\ Yl '"^(^) + 2"'M(^) > 1} . 

The set Sd,T is enumerable given d and T. 

Let E be the optimal X^-correct set (satisfying all three requirements), E{p,z) 
is the corresponding partial computable function. For any z^X* and T, put 

A(z, T) max{2-^(f) | 3k < e{z) : Zi.k G 5d,r and ^(p, Zi.k) = T} 
(if there is no such p, then X{z,T) — 0). Put 

z>,(z) 5]A(z,T)-2V(^)- 

T 

Obviously, this value is enumerable. It is not a semimeasure, but it has the following 
property (we omit the proof). 

Claim 10. For any prefix-free set A, 

This implies that there exists an enumerable semimeasure z/^ such that ^'^(-z) > 
jydiz) for all z. Actually, to enumerate Ud, one enumerates i^diz) for all z and at 
each step sets the current value of Vdiz) to the maximum of the current values of 
I'diz) and 'Yliuex^d{zu). Trivially, this provides i^diz)'>'^,^^^i'd{zu). To show that 
^d{^) ^1, let us note that at any step of enumeration the current value of i^di^) is 
the sum of current values I'diz) over some prefix-free set, and thus is bounded by 1. 
Put 

iy{z) := J]2-^('^)i.,(^). 

d 

Clearly, u is an enumerable semimeasure, thus u^z) <M{z). Let yU be an arbitrary 
computable measure, and x,yEX*. Let pE {0,1}* be a string such that K^{iJ,\x*) = 
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i{p), E{p,x)=T, and ii = jjL^ . Put d=\d^{xy\ — l, i.e., d^{x) — \<d<dn{x). Hence 
/i(a;) <2^'^M(x). Since /x=/i^ is a measure, we have 'Yliv&x^(^)l^'^i'^)~^i ^^'^ therefore 
xeSd,T- By definition, X{xy ,T) >2-^^\ thus Ud{xy)>2-^^h'^fx{xy), and 

2-^'(^)2-^(p)2^l_i{xy) < u{xy) ^ M{xy) . 

After trivial transformations we get 

M{y\x) 

which completes the proof of Theorem 7. 

8 Discussion 

Conclusion. We evaluated the quality of predicting a stochastic sequence at an 
intermediate time, when some beginning of the sequence has been already observed, 
estimating the future loss of the universal Solomonoff predictor M. We proved 
general upper bounds for the discrepancy between conditional values of the predictor 
M and the true environment /x, and demonstrated a kind of tightness for these 
bounds. One of the bounds is based on a new variant of conditional algorithmic 
complexity K^, which has interesting properties in its own. In contrast to standard 
prefix complexity K, is a monotone function of conditions: K^{y\xz*)<K^{y\x*) . 

General Bayesian posterior bounds. A natural question is whether poste- 
rior bounds for general Bayes mixtures based on general M.3 could also be de- 
rived. From the (obvious) posterior representation ^{y\x) = '^,^^j^w,y{x)i'{y\x) > 
Wf^{x)fj,(y\x), where w^{x) -. = 11],^^^ is the posterior belief in u after observing x, the 
bound A:oo < lnw^(a;<i)~^ immediately follows. Strangely enough, for A4 = A4u, 
\og2W~^ :—K{i') does no^ imply \og2Wn{x)~^ = K{iJ,\x), not even within logarithmic 
accuracy, so it was essential to consider Di-q^. It would be interesting to derive 
bounds on Di-^ or lnWi^{x)~^ for general A4 similar to the ones derived here for 
M=Mu. 

Online classification. All considered distributions p{x) (in particular ^, M, and 
11), may be replaced everywhere by distributions p{x\z) additionally conditioned 
on some z. The 2;-conditions cause nowhere problems as they can essentially be 
thought of as fixed (or as oracles or spectators). An (i.i.d.) classification problem 
is a typical example: At time t one arranges an experiment Zt (or observes data 
Zt), then tries to make a prediction, and finally observes the true outcome Xt with 
probability fi{xt\zt). In this case Ai = {u^xiynlziyn) = i^{xi\zi) ■ ...-u^Xnlzn)}- (Note 
that ^ is not i.i.d). Solomonoff 's bound K{n)ln2 (5) holds unchanged. Compared to 
the sequence prediction case we have extra information z, so we may wonder whether 
some improved bound K{fj,\z) or so, holds. For a fixed z this can be achieved by 
also replacing 2^^'^^^ in (2) by 2^^*^^l^\ But if at time t only zi-t is known like in the 
classification example, this leads to difficulties (^ is no longer a (semi)measure, which 
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sometimes can be corrected [PH04]). Alternatively we could keep definition (2) but 
apply it to the (chronologically correctly ordered) sequence ziXiZ2X2---, condition to 
zi;t, and try to derive improved bounds. 

More open problems. Since -Di:oo is finite, one may expect that the tails A:oo tend 
to as Z— >oo. However, as Lemma 3 implies, this holds only with probability 1: for 
some special a we have even Dioo{o.<i) >\K{1) — >oo. It would be very interesting 
to find a wide class of a such that Di,oo{<^<i) ^0. The natural conjecture is that 
one should take /^-random a. Another (probably, closely related) task is to study 
the asymptotic behavior of K^{ii\a<i*). It is natural to expect that K^{^\a^i*) is 
bounded by an absolute constant (independent of ji) for "most" a and for sufficiently 
large /. Finally, (dis) proving equality of the various definitions of we gave, would 
be useful. 
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